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Abstract: Often in sequential trials additional data become available after 
a stopping boundary has been reached. A method of incorporating such in- 
formation from overrunning is developed, based on the "adding weighted Zs" 
method of combining p- values. This yields a combined p- value for the primary 
test and a median-unbiased estimate and confidence bounds for the parame- 
ter under test. When the amount of overrunning information is proportional 
to the amount available upon terminating the sequential test, exact inference 
methods are provided; otherwise, approximate methods are given and eval- 
uated. The context is that of observing a Brownian motion with drift, with 
either linear stopping boundaries in continuous time or discrete-time group- 
sequential boundaries. The method is compared with other available methods 
and is exemplified with data from two sequential clinical trials. 
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1. Introduction 

Suppose a sequential trial is carried out to test a null hypothesis about a real 
parameter S. Once the trial is concluded, a non-sequential trial is conducted, with 
a test of the same hypothesis. The trials are connected in that the amount of 
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information in the non-sequential trial may depend on data accumulated in the 
sequential trial. How can the results of the two trials be combined, and a single 
overall test constructed? The context is that the data, or incremental information, 
in the non-sequential trial represent "overrunning," from "lagged" data from the 
sequential trial. 

T. W. Anderson [1] considered the problem of incorporating lagged data in an 
accept-reject rule following a sequential probability ratio test and proposed an (ap- 
proximate) likelihood ratio test. In the context of modern-day clinical trials, the 
problem of how to incorporate data from overrunning was raised and discussed by 
Whitehead [16, 17], and he gives an admittedly ad hoc solution, later named the 
deletion method [14]. This latter paper includes a comparison of the deletion method 
with methods described herein, under certain limited conditions. Another solution 
is presented in Hall and Liu [4] - actually, an extension of Anderson's likelihood 
ratio method - along with a discussion of the possible structure of overrunning in- 
formation in a sequential clinical trial. However, this solution utilizes the maximum- 
likelihood ordering of the sample space, requiring specification of the details of the 
stopping rule beyond the time a stopping boundary was first reached, in contrast 
to stagewise ordering. In this paper, we focus on procedures that do not require 
such specification. See these references for further introductory material. 

In the context of monitoring a Brownian motion with drift by periodic observa- 
tions -the context considered herein - Whitehead [16] proposes treating the final 
analysis that incorporates the overrunning data as if it were a scheduled analysis, 
but ignoring the analysis that led to stopping, and hence involving a deletion. He 
uses a stepwise ordering (as defined in [6], for example) for computing p- values and 
carrying out further inference. 

Here we provide another solution, based on the "adding weighted Zs" method of 
combining p- values (Stouffer et al. [15], Mosteller and Bush [10], Liptak [7]); one p- 
value is derived from the sequential experiment (without the overrunning) and the 
other is based solely on the incremental overrunning data. We recommend weighting 
the two p- values using observed information. This is fully legitimate only if (i) the 
amount of information in the non-sequential trial (overrunning) is proportional to 
that available at termination of the sequential trial or (ii) the sequential trial was 
actually nonsequential (and test statistics are normally distributed) . For discussion 
of (i), see [4], Section 2. 

Another issue that arises in popular group-sequential trials is that if stopping 
does not occur until the last scheduled analysis, such an analysis will ordinarily not 
be done until lagged data are available, in which case a p-value will be computed 
by standard group-sequential methods with a re-scheduled final analysis. (This is 
consistent with the deletion method.) A modification of our method, which combines 
p- values for such trials only when stopping early, is evaluated numerically. 

Another application of the combination method could be to a double sampling 
study in which the second sample size depends on the outcome -e.g., on the ob- 
served variability - of the first sample. These methods are also appropriate for a 
meta-analysis of two (or more) experiments, whether sequential or not. 

Brannath, Posch and Bauer [2] proposed p-value combination rules in a differ- 
ent context, namely that of adaptive group- sequential sampling. In their setting, 
allowance is made for the possibility of not carrying out the second stage. (In our 
context, this would constitute "preventing overrunning.") If the second stage is 
carried out, the two p- values are combined in a way that (i) preserves an overall 
significance level and (ii) recognizes the stopping rule. As here, the second stage 
p- values may be conditional on results from the first stage. They extend to multiple 
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stages recursively. Numerical integration may be required. 

The "adding weighted Zs" combination method is described in Section 2 and 
extended to an ordered sequence of possibly dependent experiments. In Section 3, 
this method is applied to sequential clinical trials with overrunning. Special atten- 
tion is given to the case of a constant amount of overrunning information, the case 
considered in [1 4], or to an amount proportional to the amount available at the end 
of the sequential trial. It is shown that the latter assumption justifies the use of 
weights related to the observed (and hence random) amounts of information. Oth- 
erwise, the use of such random weights leads to a null distribution of the p-value 
which is only approximately uniform. Still, we recommend this usage so long as the 
approximation is adequate. 

In Section 4 we show how to use the combination p-value method to compute 
estimates and confidence intervals, and in Section 5 provide formulas for evaluating 
the true confidence coefficients associated with these methods, thus enabling an 
evaluation of approximations noted above. Some evaluations are summarized in 
Section 6. 

In Sooriyarachchi et al. [44], the issue of reversals in the conclusions after in- 
corporating overrunning, from rejection to acceptance of a null hypothesis or vice 
versa, was raised. They found, in the cases treated numerically there, that both the 
deletion method and the combination method might lead to an uncomfortable level 
of reversals, with the deletion method doing so less frequently They also noted that 
both methods (in cases treated) sometimes lead to reduced power. We consider 
these issues in Section 7 and indicate a modification of the combination method 
that reduces these effects. 

The methods are applied to data from the MADIT trial [8] in Section 8 -for 
both the actual linear-boundary design and for an imagined group-sequential ver- 
sion. Results are compared with those from the deletion and ML-ordering methods. 
Results from a second example [9] are briefly summarized. 

Some final comments appear in Section 9, including a summary comparison of 
the alternative methods for incorporating overrunning. 

2. Combining p-values by adding weighted Zs and an extension 

We suppose some potential data X are to be available for testing a null hypothesis 
about a real parameter S belonging to an interval A. For each 8 £ A, we consider a 
test of S = S versus S > S a , with p-value p(x; 5 ) when X = x is observed. Suppose, 
for each 8 0l P = p(X;S ) is uniformly distributed on (0,4) when 5 = S and that, 
for each x, p{x;5) is increasing in 5. Then V = {p(-;S )\S € A} defines a proper 
family of p-values for this testing problem. 

This is an overly strict definition. We have restricted attention to test func- 
tions with continuous distributions, and stochastic ordering (increasing or non- 
decreasing) of p-values would allow for differing sample spaces, but the conditions 
given meet our application. We usually omit the word "proper." 

The ordering is needed to avoid possible inconsistencies. Data considered to 
be "more extreme" than that observed should have higher probability under an 
alternative hypothesis than under the null. Moreover, it facilitates construction of 
consistently defined confidence bounds. Simply equate the p-value for testing 8 Q 
to 7 (4 — 7, resp.) and solve for S to obtain a lower (upper, resp.) confidence 
bound with confidence coefficient 4 — 7. Choosing 7 = 0.5 yields a median-unbiased 
estimate. 
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Suppose pi and p 2 are independent p- values for the same null hypothesis, and let 
z(u) = with $(z) = 1 — $(z) and <& being the standard normal distribution 

function. Let w\ and w 2 be positive numbers for which w\ + w 2 = 1. Then 

(2.1) p = ^(wtzipi) + w 2 z{p 2 )) 

is the adding weighted Zs combined p-value [7, 10, 12, 15]; also see [13]. It is readily 
seen that the argument of 5> in (2.1), with pi replaced by the random variable Pi, is 
distributed as standard normal under the null hypothesis, and hence P in (2.1) is 
distributed as U(0, 1). Moreover, p tends to be small whenever both p\ and p 2 are 
small. More precisely, it is seen to be proper whenever pi and p 2 are proper - the 
Pi's are increasing in S Q , so the z(pi)'s are decreasing, as is any positively- weighted 
linear combination, and hence p is increasing in 5 a . 

The interpretation should be clear: z(pi) is a standardized normal deviate that 
corresponds to the test statistic on which pi is based (whether or not pi was based 
on a normally distributed statistic), and the argument of $ in (2.1) represents a 
(weighted) pooling of normal deviates for the two independent tests, with p the 
resulting p-value. 

This combination method may be extended to settings where p\ and p 2 are 
derived from overlapping data sets but p 2 is a conditional p-value for each subset 
of data on which p\ is based. A possible context is that a second experiment was 
designed based on the outcome of the first experiment, and a conditional test was 
used in the second experiment. Formally, 

Proposition 2.1. Supposepi = pi(x;8) andp 2 = p 2 {x,y;8), andV\ = {pi(;S )\S 
G A} is a family of p-values and V 2 = {p 2 (x, ■; 6 )\S G A} is, for each X = x, 
a family of conditional p-values. Then V 2 is a family of unconditional p-values, 
Pi(X,S ) _L p 2 (X,Y;8 ) (independent) for each S a , and (2.1) defines a family of 
p-values. 

Proof. Since p 2 (X,Y; S ) is conditionally [7(0, 1) for every X, it is unconditionally 
U(0, 1), and the needed monotonicity also follows. For each S , the joint distribution 
function of (Pi,P 2 ) is 

Pr{Pi < u u P 2 < u 2 } = E{l{p 1 {X)<u 1 )-E[l(p 2 {Y,X)<u 2 \X}} 
= E{l(pi(X) < ui) ■ u 2 ) = uiu 2 , 

from which independence follows, and this is sufficient for the claim about p. □ 

Now what about the weights? Ordinarily, they might be related to sample size 
or information. Specifically if the p^s are derived from tests based on means of ni 
normally distributed observations (with common variance), then a combined p with 
Wi cx ^/n~i would yield the same p as that from a pooling of the two samples. So 
far, we have only assumed the weights to be positive constants - depending neither 
on S nor on the data. Here are some partial extensions; examples of each appear 
in the next section. 

The weights may depend on 6 without affecting the null distribution of P in 
(2.1), but the monotonicity in 5 may be destroyed except for special choices. 

The weights may be random (depending on X) without affecting the monotoni- 
city in S , but would typically disturb the uniformity of the null distribution of P. 

It should be emphasized that all p-values considered above are for one-sided 
alternatives. After including overrunning, the usual convention of doubling them 
for 2-sided alternatives may be appropriate. 
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Finally, we note that all of this can be directly extended to an ordered set of 
several p-values, each involving new data and conditional on all past data, and 
combined in the "adding weighted Z's" fashion. 

Specifically, let pk be a p-value for the incremental stage-fc data, conditional on 
data from all prior stages. Then define a stage-fc combination p-value by replacing 
the argument of $ in (2.1) by X}j=i w k:iz(pi) with stage-fc weights all positive, 

satisfying J2i=i w l-.i = ^ an( ^ w l-.i = w l-i-.i ' - w l-.k) f° r * < Equivalently, (2.1) 
may be applied recursively, replacing p\ by a combined p from earlier stages with 
weight wi for this new p\ and u>2 for the incremental data, with w\ + w\ = 1. 

3. Incorporating overrunning by combining p-values 

We now assume a sequential experiment takes place, resulting in an observation of 
(T, X), say. We focus on the context of observing a Brownian motion X{t) with 
drift S, with a stopping time T and X = X(T) upon stopping, but other contexts 
may be treated similarly. After stopping, some additional data become available, 
represented by further observation of the process for t Q = t Q (T,X) units of time. 
Conditional on t Q , a sufficient statistic for the overrunning data is the increment 
Y observed during the overrunning time increment t Q . In other words, a sequential 
experiment is followed by a non-sequential one, with sample size (observation time) 
depending on the outcome of the sequential trial. There may be additional random- 
ness in t Q ; it is sufficient to let t {t, x) be the conditional expectation of overrunning 
information, given (T, X) = (t,x). See ([4], Section 2) for discussion supporting t Q 
being a constant, oc \ft, or oc t as possible approximations to reality. 

Upon reaching a stopping boundary, a p- value p\ for a null hypothesis about the 
drift parameter is defined: 5 — S a versus S > S . And at the end of overrunning, 
a conditional p-value pi is simply $((y — 6 t )/\/to), given t Q = t a (t,x). A com- 
bination p-value is therefore given by (2.1). (Here, (T,X) plays the role of X in 
Section 2.) Hence, 

Corollary 3.1. Suppose w\ and w-i are positive constants for which w\ + = 1. 
Then 



defines a family of p-values. 

But how should the weights be chosen? It is tempting to choose them to be 
proportional to the square-root of information in the respective parts of the exper- 
iment. Then each summand in (3.1) would have variance or conditional variance 
equal to the information in that part of the experiment. Using expected informa- 
tion, w\ = E So (T)/[E So {T) + E So t {T,X)} and w\ = 1 - w\. But, as noted in 
Section 2, this would not typically preserve the needed monotonicity of p(6 a ) in 
(3.1). Moreover, knowledge of the functional form of the dependence of t a on (t, x) 
would be needed. If t Q were constant, this would yield 



(3.1) 



p(t, x, y; S ) = $ (w 1 z{p 1 (t,x;S )) + w 2 (y - S t )/y/%) 



t =t (t,x) 



, S e A 



p{t,x,y;S ) = * 



[EsAT)} 1 / 2 z( Pl (6 )) + y-Spt, 
[E Sa (T) + to] 1 / 2 



O 



) 



This could be used as a p- value for a single null hypothesis, but it would not 
be suitable for construction of confidence bounds, unless E$ o (T) was replaced by 
E$i (T) for a fixed S' a . Because of these limitations, we abandon this approach. 
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Suppose instead we use the square-roots of observed information, namely \fi and 
y/t 0> yielding 



Monotonicity in S (for each (t, x, y)) is maintained, but the uniformity of the null 
distribution would appear to be in doubt. However, to compute p, no knowledge of 
the dependency structure of t a is required, only its observed value. 

We now consider the special case of (3.1) and (3.2) with t cx t, say t a = ct. Since 
w\ =t/(t + ct) = 1/(1 + c) and w\ = c/(l + c), this yields constant weights; and c 
is known once T = t and t Q are observed. Hence, the use of observed information 
in this case is justified. 

Corollary 3.2. If, for some constant c, t Q {x, t) = ct for all (x, t), then (3.2) defines 
a family of p-values. 

For the group-sequential case with up to K analyses and stagcwise ordering, we 
modify the combination p- value (3.2): For testing 5 a , with t k = t a (k), the modified 
p- value is defined as 



where pi(t, x; S ) is the group-sequential stagcwise p- value for testing 5 — S a versus 
larger values when the analyses are scheduled at t\, . . . , tic-x, t° K = tic + t x with 
early-stopping sets Sk (k < K) (each the complement of an interval) . This matches 
the deletion method when stopping has not occurred early. 

For a group-sequential ML-ordcring, the ML- ordering method [4] may be more 
suitable. 

We show in Sections 5 and 6 that use of p in (3.2) or (3.3), for several choices of 
the dependency of t a on (t, x) and two popular sequential designs, for constructing 
confidence bounds and intervals may yield adequately accurate confidence coeffi- 
cients. This leads us to recommend the use of (3.2) or (3.3) as if it were a bona 
fide combination p-value, if the design chosen and the likely form of dependency 
are similar to those considered in Section 6. 

One last variation permits further adjustment of the weighting: Use weights with 
squares proportional to T and pt a (T) for a specified weighting factor p > 0. For 
motivation, see Section 7. 

4. Computing p-values and confidence bounds 

Here we act as if t oc t, and discuss the use of (3.2) and (3.3) for obtaining p-values 
and, by inversion, confidence bounds and intervals. 

For any particular null value S , the combined p-value p(5 ) may be computed 
from (3.2) or (3.3) with t (or tk), x, y and t a the observed values, and using software 
that enables computation of pi(6 ). For general linear boundaries, such software 
is available from the authors (based on formulas in [3]), and the PEST software 
[11] provides such output for a limited selection of linear boundaries and group- 
sequential modifications of them. For group-sequential boundaries with stagewsie 
ordering, a program - built around software for pi(S ) from Jennison [5] - is available 
from the authors. 



(3.2) 




(3.3) 
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To obtain an upper confidence bound with confidence coefficient 7, we need to 
solve p(5) = 7 for 5 = Sjj, or equivalently, solve z(p(5)) = 2(7)- A little algebra 
leads to the equivalent problem - except in the group-sequential case with t = tx - 
of solving 5-h(6)-[y-Vt° 2(7)] /t a = where t° = t+t a and h(S) = y/tz{pi{5))Jt . 
Starting from a trial solution 5° , and computing h{5°) and h{5° + e) for some small 
e, an improved solution is 

c = » _ ^-fe(^)-[y-V^^(7)]/to 

1 + [&($«>)- A(<y° + e)]/e ' 

We find that two or three iterations provide good accuracy. (When t = t^, we only 
need solve p\{t° K ,x + y;5) = 7.) Alternatively, a trial-and-error approach works 
quite satisfactorily. 

5. True confidence coefficients 

We now evaluate the true confidence coefficient for a confidence bound or interval 
determined by using (3.2), whether or not t Q is proportional to T. Let <5 7 be an upper 
confidence bound determined by the method of the previous section for a nominal 
confidence coefficient 7. The question is: what is the true confidence coefficient? 
We need to evaluate, for given 7 and 5, q 1 (8) = P§(5 < <5 7 ) and determine 
q° = M S q 7 (S). 

As noted in Section 4, <5 7 is the solution to 5 — h(S) = [y — \/t^z{'y)\/t = 
g{y,t°,t ,j). Since h is decreasing in 5, the left side is increasing in 5, and hence 

6 < Sy iff S — h(S) < g(y,t°,t ,j). This latter event is equal to the event (y — 
St )/y/to > [—vtz(pi(t,x;S)) + \ftP z(j)]/^/t^. Therefore, conditioning on (T,X) 
and hence on T , we have 

(5.1) „«) = E, m < . E,* ( r ' , ^ (T ^»- r ° 1/2 ^ )) ) ■ 

If this combination p- value were bona fide -that is, if T Q ex T -the result would be 

7 identically in 5. 

The true confidence coefficient for an (equal-tail) confidence interval based on 

(3.2) may be obtained similarly. For an interval with nominal confidence coefficient 
7, the true confidence coefficient is Q° = inf^ Q 7 (<5) where 

(5.2) Q 7 (<5) = <7 ( i- 7 )/2(<5)-<7(i+ 7 )/ 2 (<5)- 

For the group-sequential modification (3.3), (5.1) needs to be modified when T = 
Tu- 
ft. Computational support for approximations 

Here we report on some numerical evaluations of the validity of using (3.2) or (3.3) 
when t is not proportional to the observed stopping time t, and the validity of 
using the combination method only when stopping after an interim analysis. For 
various special cases and many values of 8, we computed (5.1) for 7 = 0.5 and (5.2) 
for 7 = 0.9 and 0.95 to see how close they are to the respective nominal values of 
0.5, 0.9 and 0.95. We summarize some of the findings here. 
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A linear-boundary design: Consider triangular boundaries for testing 6 = versus 
6=1 with intercepts ±5.99, slopes 0.75 and 0.25, and apex at t = 23.97. This 
design has both error probabilities 0.025. The design may be adapted for testing 
6 = versus ^ (as prescribed by the PEST software). The resulting one-sided 
rejection region is the upper boundary for which the power at 6\ = 0.8233 is 0.9. 
The expected stopping time is 7.776 at S = (or 1) and 9.382 at Si, and has its 
maximum of 11.217 at 6 = 0.5. 

We considered t a cx T, t Q constant and t Q oc y/T. For the first case, we simply 
verified the accuracy of our computer program, finding that the distribution of the 
p- value was exactly uniform, and that the true confidence coefficients matched the 
nominal ones exactly. 

For the constant case, we considered t D = cE$ l (T) with c ranging from 0.1 to 
0.5. Here are selected results: 

c = 0.1 c = 0.5 

0.487 < q. 5 < 0.513 0.471 < q. 5 < 0.529 

0.900 < Q.g < 0.908 0.899 < Q. 9 < 0.917 

0.950 < Q.95 < 0.955 0.949 < Q. 95 < 0.960 

For c = 0.1, the true confidence coefficients Q° for nominal 90% and 95% confidence 
intervals are therefore correct (to 3 decimal places), and only slightly below the 
nominal values for c = 0.5. However, the median-unbiased estimate may have a 
few percentage points of median-bias, depending on the true 6. We also found that 
<7.5 < 0.5 for 6 > 0.5 and vice versa. Computations for c-valucs between 0.1 and 
0.5 yielded bounds between the respective ones in the display above. Results for 
t a oc VT were uniformly better than those for t a constant. 

An O'Brien-Fleming group-sequential design: Consider an O'Brien-Fleming two- 
sided design for testing 6 = with significance level 0.05 and power 0.9 at 6 = ±1, 
with a maximum of 5 analyses. We assume equally spaced interim analyses, at 0.2, 
0.4, 0.6, 0.8 times t 5 = 10.781, with boundary values of ±6.6988 (obtained from [6]). 
Again, we considered t D oc T for the unmodified combination p- value to confirm the 
accuracy of our programs. For the modified p, we considered t constant, namely 
= ci,5, and t Q = ctk', in each case, c ranged from 0.02 to 0.1. 

Here are some of the results: 

t — c i 5 t = c t k 

c = 0.02 c = 0.l c = 0.02 c = 0.1 

q° 5 0.475 0.445 0.478 0.451 

Q°9 



.95 



0.894 0.887 0.894 0.888 

0.947 0.943 0.947 0.943 



Again, although q° 5 may be as small as 0.44 (and by symmetry 0.44 < g 5 (<5) < 0.56), 
we found that q, 5 was usually within ±0.01 of 0.5. Indeed, this occurred for all but 
1%, 7%, 1% and 4%, respectively (reading from left to right in the display above) 
of the range of 6- values within ±2.5. 



7. Reversals and power 



Sooriyarachchi et al. [14] raised concern about the frequency of reversals of ac- 
ceptance and rejection conclusions after inclusion of overrunning information, but 
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stressed their desire not to ignore such information. In simulation studies of the 
deletion and combined p-value methods, with constant amounts of lagged data (in- 
dependent of the results at the time of stopping), they found levels of reversals that 
they considered worrisome, especially for the combination method - perhaps 3 or 4 
percent. However, in popular group-sequential designs such as O'Brien- Fleming, 
reversals were rare and only defined when the trial stopped early, as an analysis at 
a final scheduled time would ordinarily await lagged data before execution. 

Of more concern to us, is their finding that both methods may lead to reductions 
in power. Intuitively, when a rejection occurs "early" , overrunning can reverse it but 
the chances of compensating with reversals in the other direction may be minimal. 

With constant overrunning information, our computations (not reported here) 
confirm theirs, but we find reversals to be somewhat less frequent when overrunning 
information increases with stopping times, and losses in power are then rarer. 

A possible compromise method is as follows: down-weight the overrunning p- 
value in the combination formula. By introducing a factor p (see end of Section 
3), it is possible to maintain power and depress the frequency of reversals but 
still not ignore the lagged data completely. However, computations show that some 
situations will require extensive down- weighting (small p). Choice of a suitable p will 
require computational trial-and-error, with assumptions about overrunning needed. 
For this purpose, we provide the following formulas. 

When the true drift is 5 (and stopping is in continuous time), the probability 
of rejection upon stopping followed by acceptance after inclusion of overrunning, 
when t cut, is 
(7.1) 

Ps(R - A) = j*™ ${[(1 + pc)/(pc)] 1 / 2 z a - [l/ipct^U, t) - 5{c±f' 2 ) dP l /(t) 

with z\{U,t) being the standard normal deviate for which the right-hand-side tail 
area beyond it is Pq (t) (the p-value when the upper boundary U is crossed at time 
t), if) being the probability of crossing the upper boundary before the lower one 
prior to t, and a being the one-sided significance level for testing 6 = 0. For group- 
sequential tests, the integrator in (7.1) is dPf{x,t), indicating a need to integrate 
over x-values where t = tk and the upper boundary has been reached, but t may 
be restricted to {tk\k < K} since reversals at a final analysis have no role. 

Similarly, P$(A — > R) is given by (7.1) with U replaced by L (for lower boundary) 
and (f> replaced by $. Finally, the power after inclusion of overrunning, when the 
power of the original design is pow{5), is 

ovpowiS) = pow(S) - Ps (R — ► A) + P S (A -> R) . 

(Software is available from the authors.) 

8. An example: the MADIT study 

MADIT (Multicenter Automatic Defibrillator Implantation Trial [8]) was a ran- 
domized clinical trial conducted to evaluate the effectiveness of an implanted de- 
fibrillator compared with conventional drug therapy to reduce mortality associated 
with ventricular arrhythmias. Monitoring was based on the logrank statistic plot- 
ted against its estimated variance [17]. This behaves like a Brownian motion with 
drift 6 = — log(_ffi?) where HR is the hazard ratio of the treatment-to-control arms 
(assuming proportional hazards). The essential features were reviewed in [4] and 
are summarized here. 
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A triangular design was used that assures a two-sided significance level of 5% 
and a power of 90% at a hazard ratio of 0.537 (drift = 0.6218). Monitoring was 
carried out weekly over the five years of the trial, thereby yielding nearly-continuous 
observation of the logrank process. The stopping boundaries were Ut = 7.935+0. 189i 
and It = —7.935 + 0.566t, with the early part of the lower boundary (7 t ) a rejection 
region for superiority of the control arm. 

Interpolating, the upper boundary was reached at t = 12.145 with x = 10.230, 
later corrected to t = 12.037 and x — 10.210. The incremental coordinates for 
overrunning were t Q = 1.240 and y = 2.957, showing an upturn in the sample path 
after reaching the boundary. 

Respective p- values and estimates of the drift and of the HR are presented below, 
contrasting results of analyses without and with the use of the overrunning data. 
Values in square brackets are those reported in [4] for the ML- ordering method, 
assuming t cx ^/t ; with linear boundaries and no overrunning, stepwise ordering 
and ML-ordering arc identical. 

overrunning 2- sided p med-unb-est 95% confidence interval 

Inference about the drift S 
without 0.0084 0.786 (0.204, 1.361) 

with 0.0009 [0.0029] 0.938 [0.939] (0.388, 1.484) [(0.329, 1.543)] 

Inference about the hazard ratio HR = exp(— 6) 
without 0.0084 0.456 (0.256,0.815+) 

with 0.0009 [0.0029] 0.391 [0.391] (0.227, 0.678) [(0.214, 0.720)] 



Both methods reflect the upturn in the sample path during overrunning as the 
"with' p- values are smaller and the estimates farther from the null values. But the 
combination method gives the smaller p-value and narrower confidence intervals; 
this may reflect the different orderings being used by the two methods. 

Values reported in [8] were based on Whitehead's deletion method] they are 
identical to the "without overrunning" values in the display above, as the deletion 
method essentially ignores overrunning when the path continues in a similar direc- 
tion and there is near-continuous monitoring. (It was this observation that inspired 
the development of alternative methods for incorporating overrunning.) In such set- 
tings, the deletion-method p- value cannot be smaller than when computed upon first 
hitting an upper boundary, irrespective of the nature of the overrunning data. (For, 
when reaching the upper boundary at time t, with the prior analysis a short time 
earlier, at time t~ say, and then overrunning to x° at a later time t°, the deletion one- 
sided p is the null probability of {T < t~ andX{T) > u T }U{T > tandX(t°) > x°}. 
These two events are disjoint, and the former is virtually the extremal set without 
overrunning.) 

We now consider a group-sequential variation on MADIT as described in [4] . We 
pretended that an O'Brien-Fleming 5-analysis design was used for testing 5 = 
versus ^ with power 80% at a HR of 0.537. It would have stopped at the third 
interim analysis with the results obtained upon hitting the boundary in MADIT. 
Results of analyses are reported below; for comparison, values in square brackets 
are those reported in Table 2 of [4] using the ML-ordering method and assuming 
t oc \ft . Values for the deletion method - which treats the analysis after overrunning 
as a replacement for the third scheduled analysis -are also given. 

In each case -i.e., without or with overrunning -results from the group-sequential 
combination method indicate a more significant departure from the null value of 
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HR = 1 than do those by the group-sequential ML-ordering method. At least for 
the "without' results, this is attributable to the different orderings used. This time 
the deletion method gives results similar to those from ML-ordering. (These results 
are not directly comparable to those in the previous table since the pretended 
group-sequential design has reduced power.) 

overrunning 2-sided p med-unb-est of HR 95% confidence interval 

Group-sequential inference about the hazard ratio 
without 0.0039 [0.0041] 0.431 [0.468] (0.244, 0.762) [(0.295, 0.775)] 
with 0.0004 [0.0011] 0.373 [0.384] (0.217, 0.641) [(0.221, 0.672)] 

Deletion method 0.0014 0.384 (0.221, 0.680) 



Here is a brief summary of results from a second defibrillator trial, MADIT-II [9] , 
in which the combination method was pre-specificd. The design was again triangular, 
with a 5% 2-sided significance level and power 95% at a HR of 0.627: ut = 11.77 + 
0.1273 1 and l t = -11.77 + 0.3819 1. This time (t,x,t a ,y) = (45.415, 17.551, 0.483, 
1.441). The results were: 

overrunning 2- sided p med-unb-est of HR 95% confidence interval 

Inference about the hazard ratio in MADIT-II 
without 0.028 0.708 (0.525,0.962) 

with 0.016 [0.023] 0.688 [0.689] (0.511, 0.932) [(0.504, 0.948)] 



Again, the deletion method of incorporating overrunning would have agreed with the 
"without" analysis, and results from the ML-ordering method (in square brackets) 
are mainly intermediate. 

9. Final remarks 

Proposition 1 applies to other methods of combining p-valucs, such as Fisher's 
summing of — log(l — Pi). (For a description of such methods, see [12] or [13].) 
We chose the "adding Zs" method for two reasons: (i) It lends itself naturally to 
weights - it would be unreasonable to give equal weights to a long trial and a small 
amount of overrunning - and (ii) it reduces to standard normal-theory methods 
when the sequential component is replaced by a non-sequential one - equivalently, 
if a naive analysis is done after stopping rather than one recognizing the stopping 
rule. 

Here are some of the pros and cons of various methods for incorporating over- 
running: 

(a) Deletion method: Not suitable for near-continuous monitoring. Ignores the 
fact that, at the boundary-hitting stage, the monitoring statistic was in a stopping 
region but is the natural approach in a group-sequential trial when early stopping 
has not occurred. Simple to use. Results in approximate p-values and final inference. 
Limited computations show that a loss in power may occur. 

(b) Combination p-value method: Makes direct use of the analysis that led to 
stopping. Approximate except when t D cx T, and even then for common group- 
sequential designs. Uses stage- wise ordering, and hence free of any direct dependence 
on future stopping boundaries. Needs no formal assumption about the form of t Q . 
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Computations show a loss in power may occur. May be modified to reduce the 
chance of reversal after overrunning and loss in power. 

(c) ML-ordering method: Based on a minimal sufficient statistic, and hence ig- 
nores which boundary was first reached and when. Exact, up to needed assumptions 
about overrunning information (and Brownian motion approximation). Requires an 
assumption about the form of t a (t), but not very sensitive to it in the practical cases 
examined. Uses ML-ordcring and hence depends on stopping boundaries beyond 
those when boundaries were first reached. 

Sooriyarachchi et al. [14] conclude that (a) is preferable although they only 
considered constant amounts of overrunning whereas our focus has been on settings 
where the amount of overrunning information is likely to increase with increased 
stopping times. They highly stress the possibilities of reversals, but such possibilities 
cannot be avoided once one agrees to utilize lagged data. The chances can be 
reduced within the combination method by reducing the weight given to lagged 
data, but this would need to be considered in advance of the trial. 

We recommend (c) in settings where the design is likely to be followed closely. 
A numerical study of reversals and power with the ML-ordering method will be 
presented elsewhere. Otherwise, we think the combination p-value method, possibly 
with a down-weighting of overrunning information, is a competitor worthy of con- 
sideration, especially when overrunning increases with increasing stopping times. 

We encourage investigation of the combination method in other settings, includ- 
ing meta-analyses and double sampling. 

Acknowledgments. The first MADIT trial stimulated the research reported 
here. We thank Arthur Moss, MD, and Boston Scientific Corporation (formerly 
CPI/Guidant), respectively, for leadership and sponsorship of this trial. We are 
also grateful to John Whitehead for helpful discussion and to Michael McDermott 
for some references to the p-value literature. 



References 



Anderson, T. W. (1964). Sequential analysis with delayed observations. 
J. Amer. Statist. Assoc. 59 1006-1015. MR0175262 

Brannath, W., POSCH, M. and Bauer, P. (2002). Recursive combination 
tests. J. Amer. Statist. Assoc. 97 236-244. MR1947283 

Hall, W. J. (1997). The distribution of Brownian motion on linear stopping 
boundaries. Sequential Analysis 16 345-352. Addendum in Sequential Analysis 
17 123-124. MR1491641 

Hall, W. J. and Liu, A. (2002). Sequential tests and estimators after 
overrunning based on maximum-likelihood ordering. Biometrika 89 699-707. 
MR1929173 

Jennison, C. (1999). Group sequential software at website: 
http: / /www. bath. ac.uk/ mascj /book/programs / general. 

Jennison, C. and Turnbull, B. W. (2000). Group Sequential Methods 
with Applications to Clinical Trials. Chapman & Hall/CRC, Boca Raton, FL. 
MR1710781 

Liptak, T. (1958). On the combination of independent tests. Magyar Tud. 
Akad. Mat. Kutato Int. Kozl. 3 171-197. 

Moss, A. J., Hall, W. J., Cannom, D. S., Daubert, J. P., Hig- 
gins, M. D., Klein, H., Levine, J. H., Saksena, S., Waldo, A. L., 



Sequential overrunning p-values 



15 



Wilber, D., Brown, M. W., Heo, M.; for the Multicenter Auto- 
matic Defibrillator Implantation Trial Investigators (1996). Im- 
proved survival with an implanted defibrillator in patients with coronary dis- 
ease at high risk for ventricular arrhythmia. New England Journal of Medicine 
335 1933-1940. 

[9] Moss, A. J., Zareba, W., Hall, W. J., Klein, H., Wilber, D. J., Can- 
nom, D. S., Daubert, J. P., Higgins, S. L., Brown, M. W., Andrews, 
M. L.; for the Multicenter Automatic Defibrillator Implantation 
Trial-II Investigators (2002). Prophylactic implantation of a defibrillator 
in patients with myocardial infarction and reduced ejection fraction. New Eng- 
land J. Medicine 346 877-883. 

[10] Mosteller, F. M. and Bush, R. R. (1954). Selected quantitative tech- 
niques. In Handbook of Social Psychology I. Theory and Methods (G. Lindzey, 
ed.). Addison- Wesley, Cambridge, MA. 

[11] MPS Research Unit (2000). PEST: Planning and Evaluation of Sequential 
Trials, Version J±: Operating Manual. University of Reading, Reading, UK. 

[12] OOSTERHOFF, J. (1969). Combination of One-Sided Statistical Tests. The 
Mathematical Centre, Amsterdam. MR0247707 

[13] Rosenthal, R. (1978). Combining results of independent studies. Psych. Bull. 
85 185-193. 

[14] SOORIYARACHCHI, M. R., WHITEHEAD, J., MATSUSHITA, T., BOLLAND, K., 

and Whitehead, A. (2003). Incorporating data received after a sequential 

trial has stopped into the final analysis: Implementation and comparison of 

methods. Biometrics 59 701-709. MR2004276 
[15] Stouffer, S. A., Suchman, E. A., DeVinner, L. C, Star, R. M., 

Williams, R. M. (1949). The American Soldier: Adjustment During Army 

Life I. Princeton Univ. Press, Princeton, NJ. 
[16] Whitehead, J. (1992). Overrunning and underrunning in sequential clinical 

trials. Controlled Clinical Trials 13 106-121. 
[17] Whitehead, J. (1997). The Design and Analysis of Sequential Clinical Trials, 

2nd ed. revised. Wiley, New York. MR0793018 



